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Abstract — We study the problem of power allocation over two 
identical Gilbert-Elliot communication channels. Our goal is to 
maximize the expected discounted number of bits transmitted 
over an infinite time horizon. This is achieved by choosing 
among three possible strategies: (1) betting on channel 1 by 
allocating all the power to this channel, which results in high 
data rate if channel 1 happens to be in good state, and zero 
bits transmitted if channel 1 is in bad state (even if channel 
2 is in good state) (2) betting on channel 2 by allocating all 
the power to the second channel, and (3) a balanced strategy 
whereby each channel is allocated half the total power, with 
the effect that each channel can transmit a low data rate if it 
is in good state. We assume that each channel's state is only 
revealed upon transmission of data on that channel. We model 
this problem as a partially observable Markov decision processes 
(MDP), and derive key threshold properties of the optimal policy. 
Further, we show that by formulating and solving a relevant 
linear program the thresholds can be determined numerically 
when system parameters are known. 

I. Introduction 

Adaptive power control is an important technique to select 
the transmission power of a wireless system according to 
channel condition to achieve better network performance in 
terms of higher data rate or spectrum efficiency ifTl, J2j. While 
there has been some recent work on power allocation over 
stochastic channels f3), (4), J5], the problem of optimal adap- 
tive power allocation across multiple stochastic channels with 
memory is challenging and poorly understood. In this paper, 
we analyze a simple but fundamental problem. We consider 
a wireless system operating on two stochastically identical 
independent parallel transmission channels, each modeled as 
a slotted Gilber-Elliott channel (i.e. described by two-state 
Markov chains, with a bad state "0" and a good state "1"). 
Our objective is to allocate the limited power budget to the 
two channels dynamically so as to maximize the expected 
discounted number of bits transmitted over time. Since the 
channel state is unknown when power allocation decision is 
made, this problem is more challenging than it looks like. 

Recently, several works have explored different sequen- 
tial decision-making problems involving Gilbert-Elliott chan- 
nels 0, Q, ®, O, (10). In @, 0, the authors consider 
selecting one channel to sense/access at each time among 
several identical channels, formulate it as a restless multi- 



armed problem, and show that a simple myopic policy is 
optimal whenever the channels are positively correlated over 
time. In [8|, the authors study the problem of dynamically 
choosing one of three transmitting schemes for a single 
Gilbert-Elliott channel in an attempt to maximize the expected 
discounted number of bits transmitted. And in |9|, the authors 
study the problem of choosing a transmitting strategy from 
two choices emphasizing the case when the channel transition 
probabilities are unknown. While similar in spirit to these 
two studies, our work addresses a more challenging setting 
involving two independent channels. A more related two- 
channel problem is studied in iflOl , which characterizes the 
optimal policy to opportunistically access two non-identical 
Gilber-Elliott channels (generalizing the prior work on sensing 
policies for identical channels [6], |7|). While we address only 
identical channels in this work, the strategy space explored 
here is richer because in our formulation of power allocation, 
it is possible to use both channels simultaneously whilst in 
10, 0, IflOl only one channel is accessed in each time slot. 
In this paper, we formulate our power allocation problem as 
a partially observable Markov decision process (POMDP). We 
then treat the POMDP as a continuous state MDP and develop 
the structure of the optimal policy (decision). Our main 
contributions are the following: (1) we formulate the problem 
of dynamic power allocation over parallel Markovian channels, 

(2) using the MDP theory, we theoretically prove key threshold 
properties of the optimal policy for this particular problem, 

(3) through simulation based on linear programming, we 
demonstrate the existence of the 0-threshold and 2-threshold 
structures of the optimal policy, and (4) we demonstrate how to 
numerically compute the thresholds and construct the optimal 
policy when system parameters are known. 

II. Problem Formulation 

A. Channel model and assumptions 

We consider a wireless communication system operating on 
two parallel channels. Each channel is described by a slotted 
Gilbert-Elliott model which is a one dimensional Markov 
chain Gi tt {i £ {1,2},£ £ {1, 2, ..., oo}) with two states: a 
good state denoted by 1 and a bad state denoted by (i is the 
channel number and t is the time slot). The channel transition 



probabilities are given by Pr[Gi yt = l|Gj.t_i = 1] = Ai, i £ 
{1,2} and Pr[G %t = l\G i>t -i = 0] = \ ,i £ {1,2}. We 
assume the two channels are identical and independent of each 
other, and channel transitions occur at the beginning of each 
time slot. We also assume that Ao < Ai, which is the positive 
correlation assumption commonly used in the literature. 

The system has a total transmission power of P. At the 
beginning of time slot t, the system allocates transmission 
power P\(t) to channel 1 and Pa(t) to channel 2, where 
P\{t) + p2(t) = P- We assume the channel state is not 
directly observable at the beginning of each time slot. That 
is, the system needs to allocate the transmission power to 
the two parallel channels without knowing the channel states. 
If channel i(i £ {1,2}) is used at time slot t by allocating 
transmission power Pi (t) on it, the channel state of the elapsed 
slot is revealed at the end of the time slot through channel 
feedback. But if a channel is not used, that is, if transmission 
power is on that channel, the channel state of the elapsed 
slot remains unknown at the end of that slot. 

B. Power allocation strategies 

To simplify the problem, we assume the system may allocate 
one of the following three power levels to a channel: 0, P/2, or 
P. That is, based on the belief in the channel state of channel 
i for the current time slot t, the system may decide to give up 
the channel (Pj(£) = 0), use it moderately (Pi(t) = P/2) or 
use it fully(Pi(t) = P). Since the channel state is not directly 
observable when the power allocation is done, the following 
circumstances may occur. If a channel is in bad state, no data 
is transmitted at all no matter what the allocated power is. If 
a channel is in good state, and power P/2 is allocated to it, it 
can transmit Ri bits of data successfully during that slot. If a 
channel is in good condition and power P is allocated to it, it 
can transmit Rh bits of data successfully during that slot. We 
assume Ri < Rh < 2Ri. 

We define three power allocation strategies(actions): bal- 
anced, betting on channel 1, and betting on channel 2. Each 
strategy is explained in detail as follows. 

Balanced: For this action (denoted by B^), the system 
allocates the transmission power evenly on both channels, that 
is, P\ (t) = P 2 (t) = P/2, for time slot t. This corresponds to 
the situation when the system cannot determine which of the 
channels is more likely to be in good state, so it decides to 
"play safe" by using both of the channels. 

Betting on channel 1: For this action (denoted by B\), the 
system decides to "gamble" and allocate all the transmission 
power to channel 1. That is, P\(t) = P, P-z{t) = for time slot 
t. This corresponds to the situation when the system believes 
that channel 1 is in a good state and channel 2 is in a bad 
state. 

Betting on channel 2: For this action (denoted by B 2 ), the 
system put all the transmission power in channel 2, that is, 
P 2 (t) = P, Pi(t) = for time slot t. 

Note that for strategies B^ and B 2 , if a channel is not used, 
the system (transmitter) will not acquire any knowledge about 
the state of that channel during the elapsed slot. 



C. POMDP formulation 

At the beginning of a time slot, the system is confronted 
with a choice among three actions. It must judiciously select 
actions so as to maximize the total expected discounted 
number of bits transmitted over an infinite time span. Because 
the state of the channels is not directly observable, the problem 
in hand is a Partially Observable Markov Decision Process 
(POMDP). In QT|, it is shown that a sufficient statistic for 
determining the optimal policy is the conditional probability 
that the channel is in the good state at the beginning of the 
current slot given the past history (henceforth called belief) 
181 . Denote the belief of the system by a two dimension 
vector x t =(x M ,x 2 ,t), where xi, t = Pr[G?i jt = l\H t ], x 2 ,t = 
Pr[G2,t = l\ht\, where h t is all the history of actions and 
observations at the current slot t. By using this belief as the 
decision variable, the POMDP problem is converted into an 
MDP with the uncountable state space ([0, 1], [0, 1]) 0. 

Define a policy f as a rule that dictates the action to 
choose, i.e., a map from the belief at a particular time to 
an action in the action space. Let V"*(p) be the expected 
discounted reward with initial belief p = (pi,p 2 ), that is, 
Xi,o = Pr[Gi j0 = l\Ho] = pi, £ 2 ,o = Pr[G 2 ,o = l\fh] = P2, 
where the superscript 7r denotes the policy being followed. De- 
fine f3(£ [0, 1)) as the discount factor, the expected discounted 
reward has the following expression 



V*(p)=E„[^2p t g at (x t )\xo=p] 



(1) 



t=0 



where E v represents the expectation given that the policy 7r 
is employed, t is the time slot index, a t is the action chosen 
at time t, a t £ {Bb,Bi,B 2 }. The term <? at (x t ) denotes the 
expected reward acquired when the belief is x t and the action 
at is chosen: 

{xijRi +x 2 . t Ri, if a t = B b 
Xi, t Rh, if a t = Pi • 
X2,tRh, if at = B 2 

(2) 

Now we define the value function V(p) as 

y(p)=maxV A7r (p), for all p € ([0,1], [0,1]). (3) 

A policy is said to be stationary if it is a function mapping the 
state space ([0, 1], [0, 1]) into the action space {B^, Bi,B 2 }. 
Ross proved in |[T2l (Th.6.3) that there exists a stationary 
policy 7r* such that V(p) — V* (p). The value function V(p) 
satisfies the Bellman equation 



^ Qe{ r B W K(p)} ' 



(4) 



where V a (p) is the value acquired by taking action a when 
the initial belief is p. V a (p) is given by 



V a (p) = ffa(p) + (3E y [V(y)\^ = p, a = a], 



(5) 



where y denotes the next belief when the action a is chosen 
and the initial belief is p. The term V a (p) is explained next 
for the three possible actions. 



a) Balanced (action B/,): If this action is taken, and the 
current belief is p = (pi,P2), the immediate reward is p\Ri + 
p 2 Ri- Since both channels are used, the channel quality of 
both channels during the current slot is then revealed to the 
transmitter. With probability p\ the first channel will be in 
good state and hence the belief of channel 1 at the beginning 
of the next slot will be Aj.. Likewise, with probability 1 — p\ 
channel 1 will turn out to be in bad state and hence the updated 
belief of channel 1 for the next slot is Ao- Since channel 2 and 
channel 1 are identical, channel 2 has similar belief update. 
Consequently if action B^ is taken, the value function evolves 
as 

V Bb (Pl,P2) 

= Pl Ri + P2 Ri + /3[(1 - pi)(1 - p 2 )V{X , A ) 
+ pi(l - p 2 )V(Xi, A ) + (1 - Pi)p 2 V(X Q , Ai) 
+ pip 2 V(Ai,\i)}. (6) 

b) Betting on channel 1( action Bi): If this action is taken, 
and the current belief is p = (pi,p2), the immediate reward 
is piRh- But since channel 2 is not used, its channel state 
remains unknown. Hence if the belief of channel 2 during the 
elapsed time slot is p 2 , its belief at the beginning of the next 
time slot is given by 



T{pi) = P2^i + (1 - p 2 )>^o = ap 2 + A , 



(7) 



where a = Ai — Ao- Consequently, if this action is taken, the 
value function evolves as 

V Bl {pi,P2) =P\Rh + 

/3[(1 -pi)V(Ao,T(pa)) + Pl V(\ 1 ,T( P2 ))}. (8) 

c) Betting on channel 2( action B 2 ): Similar to action B\, if 
action B 2 is taken, the value function evolves as 

V B2 (pi,p 2 ) =p 2 Rh + 

(3[(l-p 2 )V(T( Pl ),A ) +p 2 V(T( Pl ),X 1 )}, (9) 



where 



T{pi) = PiXi + (1 - pi)A = api + A . 



(10) 



Finally the Bellman equation for our power allocation problem 
reads as follows 

V(p) = mex{V Bh (p),V Bl (p),V Ba (p)}. (11) 

III. Structure of the Optimal Policy 

From the above discussion we understand that an optimal 
policy exists for our power allocation problem. In this section, 
we try to derive the optimal policy by first looking at the 
features of its structure. 



where < c < 1 is a constant; and we say f(x) is affine with 
respect to x if f(x) = a + ex, with constant a and c. 

Proof: 

It is clear that V Bb is affine in pi and p 2 from d6). It is also 
obvious that V Bl is affine in p\ and Vg 2 is affine in p 2 from 
dS) and d9l). Next we will prove that V Bl is affine in p 2 and 
V B2 is affine in p\, which will make the proof complete. 

Now we prove that V Bl is affine in p 2 . We will first show 
that the second term on the right side of dSl is affine in p 2 , the 
third term can then be shown to be affine in p 2 in a similar 
manner, thus the summation of the three terms in <[8j is affine 
inp 2 . 

Now let us look at the second term on the right side of 
d8J, the main part V(Xo : T(p 2 )) is one of the following three 
forms: V Bb (X ,T(p 2 )), V B2 (X ,T(p 2 )), or V Bl (X Q ,T(p 2 )). 
The first form V Bb (Xo,T(p 2 )) is affine in p 2 because 
V Bb {X ,T(p 2 )) is affine in T(p 2 ) and T{p 2 ) = ap 2 + A 
is affine in p 2 . Similarly, the second form V B2 (Aq, T(p 2 )) 
is affine in T(p 2 ) thus affine in p 2 . The third form 
V Bl (X 0l T(p 2 )) is written as: 

V Bl (X ,T(p 2 )) 

= X R h + PX V(\ U T 2 ( P2 )) + p(l - X )V(X (h T 2 (p 2 )), 

(13) 

where T n (p) = T(T n - l (p)) =_rz^(l - a n ) + a n p. 
Since T n (p 2 ) is affine in p 2 , ( |13| l is affine in p 2 as 
soon as ^(A^r"^)) takes the form of V Bb (X u T n (p 2 )) 
or V B2 (X 1 ,T n (p 2 )), and V(X ,T 2 (p 2 )) takes the form of 
V Bb (X ,T n (p 2 )) or V B2 (X Ql T n (p 2 )), n = 2,3,4,..., which 
is affine in p 2 . If V(Xi,T n (p 2 )) continues to take the form 
V Bl (X ll T n (p 2 )) till n goes to infinity, V(X ll T n (p 2 )) will 
eventually become a constant V(Ai, y^j) because T n (p 2 ) — > 
jz 1 -, n — ¥ oo. Which means a special case of affine linearity 
in p 2 . With this we show that the third form V Bl (Xo,T(p 2 )) 
is affine in p 2 . Therefore we have shown that V(\q, T(p 2 )) is 
affine in p 2 , thus the second term on the right side of ((HI) is 
affine in p 2 . 

Similarly we can show that the third term on the right side 
of d8j is affine in p 2 , thus V Bl (pi,p 2 ) is affine in p 2 . 

The affine linearity of V B2 (p\ ,p 2 ) in p\ can be proved using 
the same technique and the detail is omitted due to space limit. 



Lemma 2. V Bi (p\,p 2 ),i € {1,2,6} is convex in p\ and p 2 . 

Proof: The convexity of V Bi ,i £ {1,2,6} in p\ and p 2 
follows from its affine linearity in Lemma 1. ■ 

Lemma 3. V(pi,p 2 ) = V(p 2l pi), that is, V(pi,p 2 ) is 
symmetric with respect to the line p\ = p 2 in the belief space. 



A: Properties of value function Proof: Define V n (pi,p 2 ) as the optimal value when the 

Lemma 1. V Bz {pi,p 2 ),i e {1, 2, b} is affine with respect to decision horizon spans only n stages. Then we have 
P\ and p 2 and the following equalities hold: V 1 (r> n ) 

V Bl (cp+(l-c)p',p 2 ) =cV Bi (p,P2) + 0--c)V Bi (p',p 2 ), = max{V^ (j (pi,p 2 ),^i 1 (pi,p2),^i 2 (pi,p 2 )} 



V Bi (p u cp+ (1 - c)p') = cV Bi (p u p) + (1 - c)V Bt ( Pl , P '), (12) 



msx{piRi +p 2 Ri,p 1 R h ,p 2 R h }. 



(14) 



V l ( P 2,Pl) 
= ^^{VB b (P2,Pl),VB 1 {p 2 ,p 1 ),VB 2 (p 2 ,p 1 )} 

= m&x{p 2 Ri+PiRhP2Rh,PiRh}- (15) 



It is easy to see that 

V 1 (pi,p 2 ) = V 1 (p 2 ,Pi). 



(16) 



Assume V k (xi,X2) = V k (x 2 ,xi),k > 1, next we will prove 
thatV k + 1 {p u p 2 ) = V k + 1 {p 2 ,p 1 ). 

V^\PUP2) 

= Pl Ri +p 2 R h + P[(l - Pl )(l - p 2 )V k (X , A ) 
+ Pi(l - Pi)V k {Xu A ) + (1 - Pl )p 2 V k (X , Ax) 
+ PiPiV k {\i,\x)] (17) 

V^ 1 ( Pl , P2 )= Pl R h + 
m-PiW k {\^T{p 2 )) + Pl V k (X 1 ,T(p 2 ))}. (18) 

V k + 1 ( Pl ,p 2 )^ P2 R h + 
m-P2)V k (T( Pl ),\ Q )+p 2 V k (T( Pl ) 1 \ 1 )}.(19) 

Using the assumption that V k (xi, x 2 ) — V k (x 2 ,xi), it is easy 
to see that 

< +1 (P2,Pl) 

= P2 Rh + /3[(1 - p 2 )V k (X , T(pi)) + p 2 F fc (Ai, T(px))] 
- p 2 i?h +/3[(1 - P 2)^(T( Pl ),Ao) +p 2 V k {T{ Pl ),\ l )] 
= ^(px.pa) (20) 

Similarly, we have Uj 2 +1 (p 2 ,pi) = V^Oi,^), and 

< +1 (P2,Pl)=< +1 (Pl,P2),tllUS 
^ +1 (P1,P2) 

= max{^+ 1 (p 1 ,p 2 ),y4+ 1 (p 1 ,p 2 ),^ 2 +1 ( Pl ,p 2 )} 

= max{Vt +1 (P2,Pi), < +1 (P2,Pi), < +1 (P2,Pi)} 

= V k+1 (p 2 , Pl ). (21) 

From the theory of MDPs, we know that V n (pi,p 2 ) — > 
V{pi,p 2 ) as n — > oo. Hence we have V(pi,p 2 ) — V(p 2 ,pi), 
for any {pi,p 2 ) in the belief space. ■ 

B: Properties of the decision regions of policy 7r* 

We use $ a to denote the set of beliefs for which it is optimal 
to take the action a. That is, 

$a = {(Pi.Pa) G ([0, 1], [0,1]), VCpx,**) = V a ( Pl ,p 2 )}, 

ae{B b ,B u B 2 }. (22) 

Definition 1. $ a is .sa/a! fo fee contiguous along p\ dimen- 
sion if we have (xi,p 2 ) G $ a ant/ (x 2 ,p 2 ) € $ a » then 
Vx G [si,a; 2 ], we have (x,p 2 ) G ^a- Similarly, we say <I> a 
is contiguous along p 2 dimension if we have {pi,yi) G ^a 
a«o? (pi,2/ 2 ) S $a> fAen Vy G [yi,y 2 ], we have (pi,y) € $ a - 



Theorem 1. $s 6 is contiguous in both pi and p 2 dimensions. 
<S>B 1 is contiguous in p\ dimension, and $b 2 is contiguous in 
p 2 dimension. 

Proof: Here we will prove the theory for $bi, and the 
results for $^ 2 and $s b can be proved in a similar manner. 
Let (xi,p 2 ), (x 2 ,p 2 ) G $Bj, next we show that {{cx\ + (1 — 
c)a; 2 ),p 2 )) is also in region "J"^, where c G [0, 1]. 

V((cx 1 + (l-c)x 2 ),p 2 ) 

< cV(xx,p 2 ) + (l-c)V(x 2 ,p 2 ) 

= cV Bl {xi,p 2 ) + {I - c)V Bl {x 2 ,p 2 ) 
= V Bl ((cxi + (l-c)x 2 ),p 2 ) 

< V{{cxi + (l-c)x 2 ),p 2 ), (23) 

where the first inequality comes from the convexity of 
V(pi,p 2 ) in p\, the first equality follows from the fact that 
(x\,p 2 ), (x 2 ,p 2 ) G ^Bi', the second equality comes from the 
fact that Vb-l is linear in p\ as in Lemma [T] the last inequality 
follows from the definition of V(pi,p 2 ). In the above equation, 
wehavel/((ca;i + (l-c)a; 2 ),p 2 ) = V Bl ((cx 1 + (l-c)x 2 ),p 2 ), 
which means (cX\ + (1 — c)x 2 ),p 2 ) is in the region $bi> 
therefore $si is contiguous in p\ dimension by definition [T] 

■ 

Theorem 2. If belief (pi,p 2 ) is in ^B t > then belief (p 2 ,pi) is 
in $s 2 . In other words, the decision regions of B\ and B 2 are 
mirrors with respect to the line p\ = p 2 in the belief space. 

Proof: Let (pi,p 2 ) be a belief state in the decision region 
of B\, then we have 



V(pi,p 2 ) 



= nmx{VB b (pi,p 2 ),V Bl (pi,P2),VB 2 (pi,p 2 )} 
= V Bl (p u p2). (24) 



Using equations (|6|l,((8]l and (J9|l,we have 

V Bl ijPi,P 2 ) 
= Pl R h + /3[(1 - pi)V(X , T(p 2 )) + PiV(X x ,T(p 2 ))] 
> V B2 (pi,p 2 ) 

= p 2 Rn + /3[(l - P2)V(T(p 1 ), A ) + P2 V{T{ Pl ), Ai)], 

(25) 



and 



V Bl {pi 1 p 2 ) 
> V Bb (pi,p 2 ) 
= Pl Ri + p 2 Ri + /3[(1 - 
+ Pi(l-p 2 )V(X 1 ,X ) 
+ PiP2V(Xi,X 1 )}. 



Pi)(l-P2)V(Ao,Ao) 
-(l-pi> 2 V(A ,Ai) 



(26) 



Now consider the belief state of (p2,Pi), 

Vb 2 (P2,Pi) 
= piR h + /3[(1 - pi)V(T(p 2 ), A ) + Pl V(T(p 2 ), Ai)] 

= Vb!(P1,P2) 

> Vb 2 {pi,P2) 

= P 2 R h + /8[(1 - P2)V(T(pi), A ) +p a V(r(p 1 ), Ai)] 
= V Bl (p 2 ,Pi), (27) 

where the second and last equations follow by comparing the 
expression in equation |25| l and using the fact that V(pi,p2) = 
V(p2,Pi) (Lemma 151. Similarly, from p6| and Lemma [51 we 
have 

V Ba (p2,Pi)>V Bh (p a) pi). (28) 

Thus we have 

V(p2,Pi) = m&x{V Bb {p2,Pi),V Bl (p2,Pi),V B2 (p 2 ,Pi)} 
= V B2 (p 2 ,Pi), (29) 

which means (p2,Pi) lies in the decision region of B2, that 
is, (p2,Pi) G < &b 2 - Thi s concludes the proof. ■ 

Theorem 3. If belief (pi-,P2) is in <f> Bb , then belief {p2,Pi) is 
in $> Bb . That is, the decision region of B\> is symmetric with 
respect to the line p\ = P2 in the belief region. 

Proof: Suppose (pi,P2) is in $s b , then we have 

V(p\,P2) = ma,x{V Bb (p 1 ,p2),V Bl (p 1 ,p 2 ),V B2 (p 1 ,p2)} 
= V Bb ( Pl ,p 2 ) (30) 

Now consider the belief state (p2,Pi), 

V Bb {p 2 ,Pi) 
= P2 Ri + piRi + /?[(1 - pa)(l - Pi)V(X , A ) 
+ Pa(l - Pi)V(Ai, A ) + (1 - P2)piV(X Q , Ai) 
+ P2PiV(Ai,Ai)] 
= V Bb {p u p 2 ) 

> V Bl (p!,p 2 ) 

= Pl R h + 0[(1 - pi)V(X , T( P2 )) + PiV(X 1 ,T(p 2 ))} 
= V B2 (p2,Pi), (31) 

where the equations follow from d6|, dSl, (p) and Lemma 15] 
The inequality comes from the assumption that (pi,P2) is in 
$ Bb . Similarly, we have V Bb (p 2 ,pi) > V Bl (p 2 ,Pi)- That is, 

V(p2,Pi) = ma,x{V Bb {p2,Pi),V Bl (p 2 ,Pi),V B2 (p2,pi)} 
= V Bb ( P2 ,Pi). (32) 

That is, (p2,pi) is in $ Bb . And this concludes the proof. ■ 

Lemma 4. After each channel is used once, the belief state 
is the four sides of a rectangle determined by four vertices at 
(Ao,A ),(Ao,Ai),(Ai,A ),(Ai,Ai) (Figure [I] (a)). 

Proof From the belief update in d6l([8])(|9), it is clear that 
the belief state of a channel is updated to one of the following 
three values after any action: Ao, Ai, or T(p), where p is the 



t>r 



<i>r 



*' 



®r 



(a) (b) 

Fig. 1 . (a) The feasible belief space, (b) The threshold on pi ( P2 = Ao (Ai)). 



current belief of a channel. For any < p < 1, Ao < T(p) = 
Ao + (Ai — Xq)p < Ai. Therefore the belief state of a channel 
is between Ao and Ai. 

Furthermore, since at least one channel is used in our power 
allocation strategy, its channel state is revealed at the end of 
the time slot. This means at least one of the channel has a 
belief of either Ao or Ai. And this concludes the proof. ■ 

Theorem 4. Let p\ e [Ao, Ai], P2 = Ao, there exists a thresh- 
old pi(Xq < p\ < AiJ such that Vpi € [Ao,pi], (pi, Ao) e 
$ Bb . (Figure^b)) 

Proof We introduce the following sets 

®l 2=Xo = {(Pi € [Ao, Ax], A ), V( Pl , A ) = V a ( Pll X )}, 
ae{B b ,B 1 ,B 2 }. (33) 



We will first prove that ^ B 2 ~ X ° 



and $ 



P2=A 



are convex, which 



is important to prove the structure of the optimal policy. When 

p 2 = A , V Bb (p 1 ,p 2 ) is rewritten as 

V Bb (pi,X ) 

= PlJ R, + A i?, + /8[(1 - Pl )(l - A )^(A , A ) 

+ pi(l - A )V(Ai, Ao) + (1 - pi)A F(A , Ai) 

+ piA V(Ai,Ai)] 

= pi[i2, - (9(1 - A0MA0, Ao) - /3A F(Ai, A ) 

- / 9AoF(Ao,Ai) + A V r (Ai,A 1 )] + Ao^ 

+ /3(l-Ao)F(Ao,Ao) + / 9A F(Ao,A 1 ). (34) 



From equation (34 > it is easy to see that V Bb (pi, Xq) is linear 
in pi. Let (xi, Ao), (#2, A ) € $* a=A ° and let c e [0, 1] then 
we have 



< 



< 



V(cxi + (1 -c)ar 2 , A ) 

cF(o: 1 ,Ao) + (l-c)y(x 2 ,Ao) 
cVs 6 (a;i, A ) + (1 - c)y Bi ,(^2, A ) 
V Bb (axi + (1 - c)x 2 ,A ) 
V(ca;i + (1 - c)x 2 ,Ao), 



(35) 



where the first inequality comes from the convexity of 
V(pi,Xq); the first equality follows from the fact that 
(xi, Ao), (x 2 , Ao) € ® P B ~ °, an d the second equality from the 
linearity of V Bb (pi,Xo); the last inequality comes from the 
definition of V(). Consequently, V Bb (cxi + (1 — c)x2, Ao) = 



V(cx 1 + (l-c)x 2 ,X Q ), hence (cxi + (l-c)a:2, A ) G $^ 2 " A °, 



P2 = A 



which proves the convexity of § B 2 °. Since convex subsets 
of the real line are intervals and (Ao,Ao) G &b~ ° (from 
the fact that $b 6 is symmetric), there exists p\ E [Ao,Ai] 
such that $^ 2 b =A ° = [Xo-Pt, A ] (Figure Rib)). In other words, 
Vpi € [Aq,Pi], (pi, Ao) € ®B b - And this concludes the proof. 



Theorem 5. Let p\ e [Ao, Ai], P2 = Ai, ?/zere exists a thresh- 
old p 2 (X < P2 < AiJ smc/z f/zaf Vpi G [p2, Ai], (pi, Ai) e 
$ Bb . (Figure\I\b)) 

Proof: Similar to proof of theorem HI we can show that 
$3~ 1 is convex, therefore it is an interval on p 2 = \i- Now 
since (Ai, Ai) is in $s b from the fact that ^> Bb is symmetric, 
there exists a threshold p 2 , such that Vpi G [p 2 , Ai], (pi, Ai) G 
*s b . ■ 



Lemma 5. In case of p 2 — Ao, it is not optimal to take action 
B 2 . In case of p 2 = A 1; it is not optimal to take action B\. 

Proof: In case of p2 = Ao, we need to prove that it is not 
optimal to take action B 2 , i.e. 



Vb 2 (pi,A ) < V Bh {pi,\o) or 
Vb 2 (pi,A ) < Vbi(pi,A ). 



(36) 



If one of the above inequalities holds, then the proof is 
complete. Because among three options, B 2 would be second 
or third then it's not optimal. We will prove the first inequality 
as follows: 

V B2 (p ls Ao) = R h Xo + f3X V(T( Pl ),Xi) 
+ j8(l-A )V(T(pi),Ao). 

V Bb (pi,X ) = RiX + RiPi + PXoPiViXuXx) 

+ 0(1 - Xo) Pl V{X 1 , Ao) + /3A (1 - Pi)F(A , Ax) 
+ /?(l-Ao)(l-p 1 )F(Ao,A ). (37) 

Then we have: 

Ve b (pi,A ) - Vb 2 (pi,A ) 
= [Ran - (R h - Ri)X ] 

+ PX [ P1 V{X 1: AO + (1 - Pl )V(X , AO - V(T(pi), AO] 
+ /3(1 - Ao)[piV(Ai, Ao) + (1 -pi)V(X , A ) 
- V(T(pi),Ao)]. 

(38) 



For the first term of ( |38] l we have: 

i?;Pi - (Rh - Ai)Ao > A [2i? ; - i?J > 0. (39) 

In the above inequality, we use the fact that p\ > Xq and 
Rh < 2i?/. 

Assume that at point (T(pi), Ai), the action Bi, i G {1, 2, 6} 



is optimal. Then for the second term of (BSJ, we have: 

Pl V{X u AO + (1 - Pi)V(A<>, AO - V(T(p!), Ai) 
= piV(Ai, Ai) + (1 -Pi)V(X , AO - K fli (T(pO, AO 
> PiVb 4 (Ai, Ai) + (1 - pOVb, (Ao, AO - V Bi {T( Pl ), AO 
= V Bi (jnM + (1 -|>i)Ao,Ai) - V Bi (T( Pl ),Xi) 
- V Bi (T( Pl ),X 1 )-V Bi (T(p 1 ),X 1 ) = 0. 

(40) 

The first inequality above is achieved from the fact that V > 
V Bi ,i G {1,2,6} and the equality after the inequality is from 
the linearity of V Bi , i G {1, 2, b} as in Lemma 1. 



Similarly, for the third term of (38 1, assume that at point 
(T(pi),A ), the action Bj is optimal. Then we have: 

p 1 y(A 1 , A ) + (1 - Pl )V(X , A ) - VfTfa), A ) 

- PiV(Ai, Ao) + (1 - pi)V(Ao, Ao) - V Bi {T( Pl ), A ) 

> Pi^s, (Ai, Ao) + (1 - pOVb, (Ao, A ) - V Bj (T(pi), A ) 
= Kfl,(piAi + (l-pOA 0l Ao)-VB,(T(pi),A ) 

- F B . (T(p x ), A ) - Vb, (T(p0, Ao) = 0. 

(41) 

Now using ( [39) , ( |40| and ( |4T] ) in ( [38] l, we have: 

V Bi ,(pi,Ao)-V B2 (pi,A o )>0. (42) 



( |42| > means that B 2 never can be optimal on the border of 

(pi,A ). 

Similar arguments can be used to prove that V Bl (pi, Ai) < 
V Bb (pi, Ai), thus B\ is not optimal on the border of (pi, Ai). 
Then the proof is complete. ■ 

C: The structure of the optimal policy 

Theorem 6. The optimal policy has a simple threshold struc- 
ture and can be described as follows (Figure pi: 



7T*(Pi,A ) = 
t*(Pi,Ai) = 

7T*(A ,P2) = 

7r*(Ai,p 2 ) = 



Bb, 


if 


Ao < Pi < Pi 


B u 


if 


Pi < Pi < Ai 


B b , 


if 


P2 < Pi < Ai 


B 2 , 


if 


Ao < Pi < P2 


Bb, 


if 


Ao < P2 < Pi 


B 2 , 


if 


Pi <p 2 < Ai 


Bb, 


if 


p 2 <p 2 < Ai 


Bi, 


if 


A > P2 < Pi 



(a) 
(b) 
(c) 
(d) 



(43) 



Proof: Let us first consider the border of (p 1; Ao). From 
Lemma B] we understand that on this border B 2 is not optimal, 
therefore the optimal action can only be Bb or B\ . Furthermore 
from Theorem ffj we know that the decision region on this 
border for Bb is the interval represented by Ao < pi < pi, 
it follows directly that the remaining part of this border must 
belong to the decision region of B\. Thus we have (|43]>(a). 
(43 i(b) specifies the optimal action on the border of (pi ,X\). 



Similar to the (|43]l(a), it is directly obtained from Theorem B] 
and Lemma |5] 
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From Theorem [6] and the condition that T(A ) < p 2 , 
T(A ) < pi, we have V(Ai,T(A )) = V Bl (\ u T{\ )), 
V(Xq,T(X )) = y Bi> (Ao,r(A )) under the assumption that 
Pi > T(Ao), V(Ai,Ai) = VB t (Ai,Ai), F(A ,Ai) = 
Vb 2 (A ,Ai), V(Ai,A ) = Vb^Ai.Ao), ^(A ,A ) = 
Vb^Ao, Ao). 

Thus d49ll can be written as 



X i Pi X o X i 

(a) (b) 

Fig. 2. Structure of optimal policy. 



(43 i(c) specifies the optimal action on the border of (Ao, p 2 ) 



It is directly obtained based on the result on the border of 
(pi,Ao) (i.e. (43 1 (a)) using Theorems [5] and [3] Specifically, 
from TheorernH we know that the decision region of B b is 
symmetric with respect to the line pi = p 2 , thus we have the 
first term of d43 



p x Rh +pi/3Vb 1 (Ai,T(Ao)) + (1 - pi)0V Bb (X Q ,T(X o )) = 
pxRi + X Ri + /3pi\ V Bb (Xi, Ai) + 0(1 - pi)A Vb 2 (A , Ai) 

+f3 Pl (l - X Q )V Bl (Ai, A ) + 0(1 - pi)(l - X )V Bb (X , A ). 

(50) 

Using the linearity of V Bl and V Bb in p 2 , and the fact that 

T(Ao) = AoAi + (1 — A )Ao, we have 

Vb^Ai.TfAo)) = AoVbjAi.Ai) + (1 - A )^ Bl (A 1 , A ), 
V Bb (X ,T(X )) = A V fl6 (Ao,Ai) + (1 - A )U B6 (A , A ). (51) 



i(c) from the first term of (031(a). Similarly, Substitute V^ (Ai, T(A )) and V Bb (A , T(A )) in (50 1 with 



from Theorem pi we know the decision regions of B\ and B2 
are mirrors with respect to the line p\ — p 2 , therefore we get 
the second term of (|43])(c) from the second term of (|43|(a). 
(43 i(d) specifies the optimal action on the border of (Ai,p2). 



(51 1, with simple manipulation we have formula (|44i. 



It is directly obtained based on the result on the border of 
(pi,Ai) (i.e. (43 1 (b)) using Theorems [2] and [3] ■ 

From the above analysis we understand that the optimal 
policy has a simple threshold structure. And it is critical to 
find the two thresholds p\ and p 2 . 

Theorem 7. LetS iij (k 1) k 2 ) = V Bi (k 1 ,k 2 )-V Bj (k 1 ,k 2 ) > (i e 
{l,2,6},j € {1,2,6}), p\ can be calculated as follows 
l)ifT(X )<p 2 ,T(X )<pi 

_ X n Ri + /3XoS2 : b(Xo, Ai) 

Pl ~ R h -R t + Woi^AX^XT) + S 2 , b (X Q , Ax)) ' 
2)ifT(X )<p 2 , T{X )> Pl 

X Ri+P(l-X )6 b , 2 (Xo,X ) 



Theorem 8. Let S hJ (k ll k 2 ) = V Bi (ki,k 2 )-V Bj (ki, k 2 ), (i e 
{1,2, &}, j G {1,2,&}), the threshold p 2 is calculated as 
follows 

1) ifT(p 2 ) > p 2 and T(p 2 ) > p x 

X x (R h - Ri) - 0X 1 6 2 , b (X o , Ai) - 0(1 - Ai)* M (A 0l A °) 



P2 



Ri - /3A 1 5 2 , 6 (Ao, Ai) - 0(1 - Ai)* M (Ao, A ) 



2) ifT(p 2 ) > p 2 and T(p 2 ) < p x 

\i(Rh-Rl)-0\i62,b(*o,*i)) 



Pi 



P\ 



R h -R, + 0X o S ltb (X 1 , Ax) + 0(1 - A )<5 hi2 ( A o, A ) ' 

(45) 

3)ifT(X )>p 2 ,T(Xo)<pi 



Ri - /3Ak5 2 , 6 (A , Ai) - 0(1 - X 1 )8 b , x (X 1 , A ) 
3)ifT( P2 )<p 2 , T(p 2 )> Pl 

= X 1 (R h -R l )-0(l-X 1 )6 b , 1 (X o ,X o )) 
92 Rt - 0X 1 S 2>b (X 1 ,X 1 ) - /SC 1 - Ai)<MA , A ) 
4)ifT{p 2 )<p 2 , T(p 2 )< Pl 

Ai(-Rh — Ri) 

P2 - 



(52) 



, (53) 



, (54) 



(55) 



Pi = 



Aq-R; + 0X o S 2 ^ b (Xo, Ai) 



R h -R t + 0X o 5 2 , b (X o , Ai) + 0(1 - X )S b , 1 (X 1 ,X ) ' 

(46) 
4) ifT(Xo) > p 2 , T(Xq) > pi, pi is calculated in \47\. 



Proof: We will prove (041 and the rest of the theorem 
can be shown in a similar manner. From Theorem [6] we know 
that at the point (p\, Ao) 

V Bl (pi,A ) = V Bt (pi,Ao). (48) 

Using ([H} and (|6j, the above is written as 

P iR h + p l 0V(X l ,T(X o )) + (1 - Pl )0V(X o , T(A )) = 
P1R1 + XqRi + /3p 1 A U(A 1 , Ai) + 0(1 - pi)A U(A , Ai) 
+/3 P i(l - Ao)V(Ai, Ao) + 0(1 - pi)(l - A )U(A , A ). 

(49) 



Ri - 0X 1 S 2!b (Xi, Ai) - /3(1 - Ai)*6,i(Ai,Ao) ' 

The proof of this theorem is similar to that of theorem [7] 
and is omitted here. 

IV. Simulation based on linear programming 

Linear programming is one of the approaches to solve the 
Bellman's equation in (HI, Based on lfT4]l . we model our 
problem as the following linear program: 

min^V(p), 
pex 

s.t. g a (p) + 0^2f a (p,y)V(y)<V(p), 

(56) 



VpeX,VaG A p , 

where X is the space of belief state, A p is the set of available 
actions for state p. The state transition probabilities / a (p, y) 



Pi 



XqRi + /3Aq<5 2 ,i(A , Ai) + /3(1 - A )&,,i(A , A ) 

i? fe - Jfc + /3A <J 2 ,i(Ao, Ai) + /3(1 - \o)(5 b ,i(\i, A ) + 6 bil (X , A )) ' 



(47) 




Fig. 3. Value function. 




Fig. 4. Optimal policy. 



is the probability that the next state will be y given that the 
current state is p and the current action is a e A p . The optimal 
policy can be generated according to 

Tr(p) = argmax( 5a (p) +/3 V/ a (p,y)y(y)). (57) 

We used the LOQO solver on NEOS Server [15] with 
AMPL input [16] to obtain the solution of equation ( Bo) . 
Then we used MATLAB to construct the policy according 
to equation ( |57] >. 

Figure BJ shows the AMPL solution of value function for 
the following set of parameters: A = 0.1, A ! = 0.9, (3 = 
0.9, Ri = 2, i?/j = 3. The corresponding optimal policy is 
shown in Figure HI The structure of the policy in Figure HI 
clearly shows the properties we gave in Theorems [T] to BJ 

In order to observe the effect of parameters Ao, Ai, Ri and 
Rh on the structure of optimal policy, we have conducted sim- 
ulation experiments for varying parameters. First, we increase 
Ao from 0.1 to 0.7, keeping the rest of the parameters the same 
as in the above experiment. Figure BJ shows the policy structure 
with different Ao. We can observe in Figure BJ that when Ao 
increases from 0.1 to 0.3, the decision region of action Bf, 






Fig. 5. Optimal policy with increasing Aq (Ri = 2, R^ = 3) 




[ 



Fig. 6. Optimal policy with increasing Aq (Ri = 2, R^ = 3.8). 



occupies a bigger part of the belief space. Whilst when Ao 
is 0.5 or greater, the whole belief space falls in the decision 
region of action B b , meaning that it is optimal to always use 
both channels in the set of this experiment when Ao > 0.5. 

Intuitively we believe the optimal policy is closely related 
to the immediate reward of the three actions. Therefore, in 
the next experiment, we set Rh to 3.8 (2 = Ri < Rh = 
3.8 < 2Ri) and repeat the above experiment, and the result is 
shown in Figure [6] Compared to Figure BJ the most obvious 
difference is that there is no zero-threshold policy structure in 
Figure [6] This is because when the immediate reward of using 
one channel is greater (bigger Rh ), it is more beneficial to 
use one channel than using both channels. 



D_ 



Fig. 7. Optimal policy with decreasing Ai (/?; = 2, R^ = 3.8). 





Fig. 8. 
0.9) 



Normalized threshold pi and p2 with different (3. (Ao = 0.1, Ai 



Figure [7] shows the structure of optimal policy when Ai de- 
creases from 0.9 to 0.15. Other parameters in this experiment 
are: A = 0.1, R t = 2, R h = 3, /3 = 0.9. As in Figure [3] both 
two-threshold and zero threshold policies are observed in this 
experiment. 

From the above observation we understand that the thresh- 
olds are sensitive to the value of Ri and Rh. Therefore, next we 
try to observe the relationship between the thresholds and the 
value of Ri and Rh, Figure [8] shows the normalized thresholds 
Pi and p 2 versus the ratio of Rh and Ri, with different discount 
factor j3, when Ao = 0.1, Ai = 0.9. p\ is normalized as 
(pi — Ao)/(Ai— Ao), representing the relative length of $^ 2_ °. 
Similarly, the normalized p2, that is, (Ai — p 2 )/(\i — Ao), is 
the relative length of & P g~ '. It is clear to see that when the 
normalized threshold p\ is 1 (p 2 is also 1), it corresponds to 
the zero threshold structure of the optimal policy. From Figure 
[8] we can also observe that the structure of the optimal policy 
is affected by the value of the discount factor /3. 

Figure [9] shows the normalized threshold p\ and p 2 with 
different values of Ao and Ai. Figure [9] gives us a whole 
picture of the structure of the optimal policy for different 
Rh/Ri and different size of the belief space. We can see 
that in all experiments with a wide range of parameters, no 
other policy structure than zero-threshold and two-threshold 





Fig. 9. Normalized threshold pi and pi- (j3 = 0.8) 



structure is observed. So we can conclude that with the help 
of linear-programming simulation, once the five parameters are 
known (Ao, Xi,Ri,Rh,0), the thresholds can be derived based 
on Figure [9] and the exact optimal policy can be constructed. 

V. Conclusion 

In this paper we have shown the structure of the optimal 
policy by theoretical analysis and simulation. Knowing that 
this problem has a or 2 threshold structure reduces the 
problem of identifying optimal performance to finding the 
(only up to 2) threshold parameters. In settings where the 
underlying state transition matrices are unknown, this could 
be exploited by using a multiarmed bandit (MAB) formulation 
to find the best possible thresholds (similar to the ideas in the 
papers [9] and |10|). Also, we would like to investigate the 
case of non-identical channels, and derive useful results for 
more than 2 channels, possibly in the form of computing the 
Whittle index [17|, if computing the optimal policy in general 
turns out to be intractable. 

Acknowledgment 

This work is done when Junhua Tang is a visiting scholar at 
USC. The authors would like to thank Yi Gai and Yanting Wu 
for their helpful discussions. This work is partially supported 
by Natural Science Foundation of China under grant 61071081 
and 60932003. This research was also sponsored in part by 
the U.S. Army Research Laboratory under the Network Sci- 
ence Collaborative Technology Alliance, Agreement Number 
W911NF-09-2-0053, and by the Okawa Foundation, under an 
Award to support research on "Network Protocols that Learn". 

References 

[1] T. Yoo and A. Goldsmith, "Capacity and power allocation for fading 
mimo channels with channel estimation error," IEEE Transactions on 
Information Theory, vol. 52, pp. 2203-2214, May 2006. 

[2] W. Yu, W. Rhee, S. Boyd, and J. M. Cioffi, "Iterative water-filling 
for gaussian vector multiple-access channels," IEEE Transactions on 
Information Theory, vol. 50, no. 1, pp. 145-152, 2009. 

[3] I. Zaidi and V. Krishnamurthy, "Stochastic adaptive multilevel water- 
filling in mimo-ofdm wlans," in 39th Asilomar Conference on Signals, 
Systems and Computers, 2005. 

[4] X. Wang, D. Wang, H. Zhuang, and S. D. Morgera, "Energy-efficient 
resource allocation in wireless sensor networks over fading tdma," IEEE 
Journal on Selected Areas in Communications (JSAC), vol. 28, no. 7, 
pp. 1063-1072, 2010. 

[5] Y. Gai and B. Krishnamachari, "Online learning algorithms for stochastic 
water-filling," in Information Theory and Applications Workshop (ITA 
2012), 2012. 



[6] Q. Zhao, B. Krishnamachari, and K. Liu, "On myopic sensing for 
multi-channel opportunistic access: Structure, optimality, and perfor- 
mance," IEEE Transactions on Wireless Communications, vol. 7, no. 12, 
pp. 5431-5440, 2008. 
[7] S. H. A. Ahmad, M. Liu, T. Javidi, Q. Zhao, and B. Krishnamachari, 
"Optimality of myopic sensing in multi-channel opportunistic access," 
IEEE Transactions on Information Theory, vol. 55, no. 9, pp. 4040- 
4050, 2009. 
[8] A. Laourine and L. Tong, "Betting on gilbert-elliot channels," IEEE 
Transactions on Wireless communications, vol. 9, pp. 723-733, February 
2010. 
[9] Y. Wu and B. Krishnamachari, "Online learning to optimize transmission 
over unknown gilbert-elliot channel," in WiOpt, 2012. 

[10] N. Nayyar, Y. Gai, and B. Krishnamachari, "On a restless multi-armed 
bandit problem with non-identical arms," in Allerton, 2011. 

[11] R. D. Smallwood and E. J. Sondik, "The optimal control of partially ob- 
servable markov processes over a finite horizon," Operations Research, 
vol. 21, pp. 1071-1088, September-October 1973. 

[12] S. M. Ross, Applied Probability Models with Optimization Applications. 
San Francisco: Holden-Day, 1970. 

[13] E. J. Sondik, "The optimal control of partially observable markov pro- 
cesses over the infinite horizon: Discounted costs," Operations Research, 
vol. 26, pp. 282-304, March/April 1978. 

[14] D. P. D. Farias and B. V. Roy, "The linear programming approach 
to approximate dynamic programming," Operations Research, vol. 51, 
pp. 850 - 865, November-December 2002. 

[15] "Neos server for optimization." http://neos.mcs.anl.gov/neos/ 

[16] R. Fourer, D. M. Gay, and B. W. Kernighan, AMPL: A Modeling 
Language for Mathematical Programming. Brooks/Cole Publishing 
Company, 2002. 

[17] P. Whittle, "Multiarmed bandits and the gittins index," Journal of the 
Royal Statistical Society, vol. 42, no. 2, pp. 143-149, 1980. 



